Introduction to Open Data Science - Course Project

About the project

2: Regression and model validation

Describe the work you have done this week and summarize your learning.

2.1 Data wrangling.

After understanding the metadata file and skipping the Finnish language instructions that were unnecessary it was easy to prepare a data file this data file. The script can be found here.I used plyr and tiidyverse to do the conversions of the raw data into a workable file.

2.2 Analysis

The dataset is the result of a survey on teaching and learning that was conducted as a research project in 2014. The dataset was loaded in the next chunk and the dimensions and structure of the document are output.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
learning2014_csv <- read_csv("/home/alex/ods_course/IODS-project_R/data/learning2014_AA.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014_csv)
## [1] 166   7
str(learning2014_csv)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data.frame contains 166 lines and 7 columns. The first column encodes gender, and has the character type. The rest are all numeric, and contain the data which will be used for the regression analysis. The “attitude”, “deep”, “stra”, and “surf” columns contain the combinations of results from the original dataframe. The full metadata explanation can be found here. The explanation for the study can be found here.

Briefly:

  • attitude variable is Global attitude toward statistics composite score of the questions having to do with global attitude toward statistics
  • deep variable is Deep approach adjusted composite score of the questions connected with deep understanding of statistics
  • stra variable is Strategic approach adjusted composite score of the questions connected with how strategically the participant approaches the subject
  • surf variable is Surface approach adjusted composite score of the questions connected with how whether the participant can understand the material deeply and whther the participants has problems studying.

2.2.2 Data exploration

2.2.2.1 Summary for the variables

summaries <- learning2014_csv %>%
  summarise(across(where(is.numeric), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.), median = ~median(.))))

print(t(summaries))
##                       [,1]
## age_mean        25.5120482
## age_sd           7.7660785
## age_min         17.0000000
## age_max         55.0000000
## age_median      22.0000000
## attitude_mean    3.1427711
## attitude_sd      0.7299079
## attitude_min     1.4000000
## attitude_max     5.0000000
## attitude_median  3.2000000
## deep_mean        3.6797189
## deep_sd          0.5541369
## deep_min         1.5833333
## deep_max         4.9166667
## deep_median      3.6666667
## stra_mean        3.1212349
## stra_sd          0.7718318
## stra_min         1.2500000
## stra_max         5.0000000
## stra_median      3.1875000
## surf_mean        2.7871486
## surf_sd          0.5288405
## surf_min         1.5833333
## surf_max         4.3333333
## surf_median      2.8333333
## points_mean     22.7168675
## points_sd        5.8948836
## points_min       7.0000000
## points_max      33.0000000
## points_median   23.0000000

2.2.2.2 Graphical data summary

ggpairs(learning2014_csv, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

Description

  1. The gender distribution is uneven
  2. The age Is also skewed to younger participants, with few only a older participants in each gender
  3. No other variable has a stark clustering behaviour
  4. There are significant correlations between variables
    • Negative:
      • surf/deep - these are expected to be inversely correlated, as are surf/attitude and surf/strategy
      • age/surf appear to be negatively correlated, but the p-value is between 0.5 and 0.1, as it is for surf/points.
    • Positive:
      • attitude/points, stra/points

These results make sense at the first glance.

2.2.3 Linear regression

I have chosen the three variables to check: attitude, age and surf

linear_modelling_for_learning2014_csv  <- lm(points ~ surf + attitude + age, data = learning2014_csv) 
summary(linear_modelling_for_learning2014_csv)
## 
## Call:
## lm(formula = points ~ surf + attitude + age, data = learning2014_csv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.973  -3.430   0.167   3.997  10.444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.85628    3.53786   4.765 4.18e-06 ***
## surf        -0.96049    0.79943  -1.201    0.231    
## attitude     3.42388    0.57353   5.970 1.45e-08 ***
## age         -0.08713    0.05361  -1.625    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared:  0.2082, Adjusted R-squared:  0.1935 
## F-statistic:  14.2 on 3 and 162 DF,  p-value: 2.921e-08

Only one explanatory variable showed statistically significant connection to the points outcome, thus only attitude variable will be kept for the next modelling step.

linear_modelling_for_learning2014_csv_pruned  <-lm(points ~ attitude, data = learning2014_csv) 
summary(linear_modelling_for_learning2014_csv_pruned)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014_csv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The second model is simpler, but have not lost too much of the explanatory power: - The p-value is comparably low, much lower than 0.001, meaning that there is strong support to reject the hypothesis that the 2 variables have no connection. - The estimate shows how the two variables are connected, for one unit change in points the attitude variable changes by 3.5255 The adjusted R-value 0.1856 is relatively low, indicating that the model may not be very effective in predicting or explaining the dependent variable. This might be due to various reasons like missing important predictors, non-linear relationships, or high levels of noise in the data.

2.2.3.1 Linear model diagnositic plots

plot(linear_modelling_for_learning2014_csv_pruned, which=c(1,2,5))

Using these plots we can investigate the assumptions of the model.

  1. We can use the “Residuals vs fitted” plots to investigate the constant variance assumption. If the variance is constant, we should expect to see point distributed without noticeable structure i.e. randomly. This is more or less what we see in our data. Although the points around 28 on the x axis seem to be bunched up, this may also be the result of low n-numbers, so we can assume constant variance.
  2. The q-q recapitulates plot can be used to identify whether the “normal distribution of residuals” is met. The middle part of the plot follows the 45 degree line, meaning that the distribution is close to normal for the bulk of the data. The tails of the plot do deviate form the line, meaning that there is a deviation form the perfect normality. At the bottom left there are 3 points that can be considered outliers. Further investigation is needed to determine whether these point should not be included into furhter analysis.
  3. The residuals vs Leverage plot shows whitewater the data contains influential outliers. Three points are highlighted: 71, 56 and 35. These should be investigated to determine if the data is fine, for example there can be a data entry mistake, missing value problem, or a true outlier. The points 56 and 35 are present on both the q-q and the residuals vs leverage plots, so the next step would be checking the data to see what happened to these points.

3: Logistic regression

Necessary packages

if (!require(tidyverse) == T) {
  install.packages("tidyverse")
}
library(tidyverse)


if (!require(gt) == T) {
  install.packages("gt")
}
## Loading required package: gt
library(gt)

3.1 Data wrangling

The dataframes were merged a modified according to the instructions. The resulting dataframe contains 370 observations with 35 columns. The r-script used is here.

3.2 Read the file and describe the data a bit

aa_alc <- read_csv("/home/alex/ods_course/IODS-project_R/data/aa_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(aa_alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school     : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex        : chr [1:370] "F" "F" "F" "F" ...
##  $ age        : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address    : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize    : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus    : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu       : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu       : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob       : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob       : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason     : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian   : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime : num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime  : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup  : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup     : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities : chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery    : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher     : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet   : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic   : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel     : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime   : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout      : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc       : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc       : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health     : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures   : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid       : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences   : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1         : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2         : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3         : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ average_alc: num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use   : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   average_alc = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
colnames(aa_alc)
##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "guardian"    "traveltime"  "studytime"   "schoolsup"  
## [16] "famsup"      "activities"  "nursery"     "higher"      "internet"   
## [21] "romantic"    "famrel"      "freetime"    "goout"       "Dalc"       
## [26] "Walc"        "health"      "failures"    "paid"        "absences"   
## [31] "G1"          "G2"          "G3"          "average_alc" "high_use"

The data show student achievement in secondary education of two Portuguese schools, the data contains 370 observations of 35 variables . The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The different features can be used to analyse the dataset and make predictions. A more detailed information can be found here

  1. “school”: The name or type of school the student attends.

  2. “sex”: The student’s gender.

  3. “age”: The student’s age.

  4. “address”: The student’s home address or type of locality (urban/rural).

  5. “famsize”: The size of the student’s family.

  6. “Pstatus”: The marital status of the student’s parents.

  7. “Medu”: The mother’s education level.

  8. “Fedu”: The father’s education level.

  9. “Mjob”: The mother’s job.

  10. “Fjob”: The father’s job.

  11. “reason”: The reason for choosing this school.

  12. “guardian”: The student’s primary guardian.

  13. “traveltime”: Time taken to travel to school.

  14. “studytime”: Time spent on studying outside of school.

  15. “schoolsup”: Whether the student receives school support.

  16. “famsup”: Whether the student receives family support.

  17. “activities”: Participation in extracurricular activities.

  18. “nursery”: Whether the student attended nursery school.

  19. “higher”: Aspirations for higher education.

  20. “internet”: Access to the internet at home.

  21. “romantic”: Involvement in a romantic relationship.

  22. “famrel”: Quality of family relationships.

  23. “freetime”: Amount of free time after school.

  24. “goout”: Frequency of going out with friends.

  25. “Dalc”: Daily alcohol consumption.

  26. “Walc”: Weekly alcohol consumption.

  27. “health”: General health status.

  28. “failures”: Number of past class failures.

  29. “paid”: Whether the student is enrolled in extra paid classes.

  30. “absences”: Number of school absences.

  31. “G1”: Grade in the first period.

  32. “G2”: Grade in the second period.

  33. “G3”: Final grade.

  34. “average_alc”: Average alcohol consumption (perhaps a computed variable from Dalc and Walc).

  35. “high_use”: Indicator of high alcohol use (likely a derived variable).

3.3 Choosing the variable

I decided to look at “failures”, “absences”, “freetime” and “famrel”. I would expect that:

  1. The failures might be positively associated with the higher consumption
  2. The abscences might also be positively associated with the higher consumption
  3. The freetime may be associated with weekly consuption, but It is possible that the association will be inverse.
  4. The family relations may be inversely correlated.

3.4 Testing the assumptions:

summaries <- aa_alc %>% group_by(high_use) %>% select(.,c("failures", "absences", "freetime", "famrel","high_use")) %>% 
  summarise(across(where(is.numeric), list(Average = ~mean(.), sd = ~sd(.))))
gt(summaries) %>% cols_align("left") %>% opt_stylize(color="gray", style=3)  %>% tab_header("Table for chosen variables")
Table for chosen variables
high_use failures_Average failures_sd absences_Average absences_sd freetime_Average freetime_sd famrel_Average famrel_sd
FALSE 0.1196911 0.4370868 3.710425 4.464017 3.119691 0.9869096 4.007722 0.8935266
TRUE 0.3513514 0.7464906 6.378378 7.060845 3.468468 0.9421428 3.765766 0.9337603

These are the averages and standard deviation values for all the variables, to better see which variables are interesting to look at closer.

First looking at overall structure, whehther there is something to connect the chosen variables.

ggpairs(select(aa_alc,c("failures", "absences", "freetime", "famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

As it turns out the free time is strongly correlated with family relations, which I did not expect.

Failures

ggpairs(select(aa_alc,c("failures","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The number of failures is slightly higher in the high consumption group, but not bu much. The overwhelming majority of participants in each case have 0 falures. This was expected..

Abscences

ggpairs(select(aa_alc,c("absences","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The number of absences in the high use groups is higher, as expected. The standard deviation is also much higher in this group, suggesting more variability in the absences.

Freetime

ggpairs(select(aa_alc,c("freetime","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The freetime variable has different mean and very similar sd, the high consumption group has a slightly higher average. I was interested in the relation, and the results are interesting.

Family relation

ggpairs(select(aa_alc,c("famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The family relation variable seems to be lower in the high consumption group, as expected.

3.5 Using logistic regression

first_log_regression <- glm(high_use ~ failures + absences + freetime + famrel, data = aa_alc, family = "binomial")

# create model summary and extract the coeffs
summary(first_log_regression)
## 
## Call:
## glm(formula = high_use ~ failures + absences + freetime + famrel, 
##     family = "binomial", data = aa_alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.55560    0.63703  -2.442 0.014608 *  
## failures     0.58297    0.20259   2.878 0.004007 ** 
## absences     0.08375    0.02230   3.755 0.000173 ***
## freetime     0.43091    0.12804   3.365 0.000764 ***
## famrel      -0.31981    0.13098  -2.442 0.014622 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 408.17  on 365  degrees of freedom
## AIC: 418.17
## 
## Number of Fisher Scoring iterations: 4

From the model we can see the following:

  • The higher values of failures absences and the freetime predictors are associated with a higher likelihood of high_use being TRUE, for family relations the opposite is true.
  • All predictors appear to be statistically significant, as indicated by their p-values and significance codes
  • The reduction in deviance from null to residual suggests the model with predictors fits better than the null model.
  • The reduction in deviance from null to residual suggests the model with predictors fits better than the null model - the one without ony predictor variables.
coeffs <- summary(first_log_regression) %>% coefficients()
coeffs
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.55560199 0.63703172 -2.441954 0.0146080178
## failures     0.58296988 0.20258843  2.877607 0.0040070412
## absences     0.08375499 0.02230317  3.755295 0.0001731376
## freetime     0.43090906 0.12804024  3.365419 0.0007642749
## famrel      -0.31981102 0.13098388 -2.441606 0.0146221015
OddsRatio <- exp(coeffs)
confidence <- confint(first_log_regression)
## Waiting for profiling to be done...
result <- cbind(OddsRatio, confidence)
result
##              Estimate Std. Error     z value Pr(>|z|)      2.5 %      97.5 %
## (Intercept) 0.2110623   1.890860  0.08699073 1.014715 -2.8317664 -0.32620631
## failures    1.7913506   1.224568 17.77169287 1.004015  0.1917587  0.99258448
## absences    1.0873624   1.022554 42.74681512 1.000173  0.0420769  0.13014453
## freetime    1.5386556   1.136599 28.94562433 1.000765  0.1840339  0.68715409
## famrel      0.7262863   1.139949  0.08702100 1.014730 -0.5789180 -0.06341917

Here we can see the same, the first three predictors have positive association with consumption, the family relation variable has a negative association. We can see that the strongest connection is with failures, then freetime and famrel, while the link to absences is low.

It is important to keep in mind that this is logistical regression, not the previosuly investigated linear regression. Tha means that the estimate in this table represents the “odds ratos” and be thought of in terms of likelihood. It means, that for examples in our case the increase in failures by one unit increases the likelihood of high alcohol consumption by 1.8.

The results are in agreement with the hypotheses I started with. Interestingly there seems to be a connection between the free time and the consumption levels, which were not obvious when just looking at the data.

3.6 Exploring the predictive power

We can use test the predictive of the model that we created earlier to see if it can be used to actually predict the alcohol consumption based on the the four chosen variables. We can predict for each student the probability of increased consumption.

aa_alc$predicted_probabilities <- predict(first_log_regression, type = "response")  

aa_alc <- mutate(aa_alc,prediction  = predicted_probabilities > 0.5) 

table(Actual = aa_alc$high_use, Predicted = aa_alc$prediction)
##        Predicted
## Actual  FALSE TRUE
##   FALSE   236   23
##   TRUE     85   26

These are the 2x2 cross tabulation count results.

((table(Actual_percentage = aa_alc$high_use, Predicted_percentage = aa_alc$prediction))/length(aa_alc$prediction)) * 100
##                  Predicted_percentage
## Actual_percentage     FALSE      TRUE
##             FALSE 63.783784  6.216216
##             TRUE  22.972973  7.027027

The False outcome was correctly predicted for 236 participants (63%), True was predicted for 26 (7%).

library(ggplot2)
ggplot(aa_alc, aes(x=high_use, y=prediction)) +
  geom_jitter(color="blue") +
  theme_minimal() +
  labs(title="Actual vs Predicted Alcohol Consumption")

confusion_matrix <- table(Predicted = aa_alc$prediction,Actual = aa_alc$high_use)
training_error <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
training_error
## [1] 0.2918919

The total training error was 29%

3.6.2 Comparing to simple guessing strategy

We can compare our model to a simple guessing strategy - always predicting the most common class

First we determine which is the bigger group:

sum(aa_alc$high_use)/length(aa_alc$high_use)
## [1] 0.3

There are 30% of higher consuming participants, so the more prevalent group is low consuming.

simple_guess_accuracy <- mean(aa_alc$high_use == F)
model_accuracy <- mean(aa_alc$high_use == aa_alc$prediction)
simple_guess_accuracy
## [1] 0.7
model_accuracy
## [1] 0.7081081

The accuracy of the model is marginally better than guessing.

3.7 Cross-validating the model

We can preform a cross-validation of the model, meaning that we will test the model performance on subset of the same data fo determine how accurately the chosen model can work in practice. For that we will be using the cv.glm function from the boot library. The idea is to test how well the model predicts the status using the ten different subset subsets fo the data in sequence.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
library(boot)

cv <- cv.glm(data = aa_alc, cost = loss_func, glmfit = first_log_regression, K = 10)

cv$delta[1]
## [1] 0.2945946

The prediction of my model is 0.289 compared to the 0.26, which is worse. I used 4 predictors, and i did not include the sex predictor into my model, which might be important. Additionally two of my chosen factors were correlated quite strongly, so one might not add a lot to the model.

3.8

We can try to use all the predictors to see how the number of predictors influences the model parameters. For this we exclude all the predictors that were used to create the outcome variable

First we load additional lobraries

if (!require(caret) == T) {
  install.packages("caret")
}
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(caret)

if (!require(glmnet) == T) {
  install.packages("glmnet")
}
## Loading required package: glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(glmnet)
predictors <- setdiff(colnames(aa_alc), c("Dalc","Walc","average_alc","high_use","predicted_probabilities","prediction" ))
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()

aa_alc$high_use <- as.factor(aa_alc$high_use)

# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
    # Select a subset of predictors
    current_predictors <- predictors[1:i]
    formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
    set.seed(123)  # for reproducibility
    fitControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
    model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
    
    # Store training and testing errors
    training_errors <- c(training_errors, mean(model$results$Accuracy))
    testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
    number_of_predictors <- c(number_of_predictors, i)
    model_summaries[[i]] <- model$coefnames
}

results_df <- data.frame(
    Number_of_Predictors = number_of_predictors,
    Training_Error = training_errors,
    Testing_Error = testing_errors
)

# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
    geom_line(aes(y = Training_Error, colour = "Training Error")) +
    geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
    labs(title = "Training and Testing Errors by Number of Predictors",
         x = "Number of Predictors",
         y = "Error") +
    theme_minimal()

This shows, that the number of predictors makes the model worse, however there is an increase in model performance at 26th point in the plot. If we look at the predicot first present in the 26th point:

setdiff(model_summaries[[26]],model_summaries[[25]])
## [1] "failures"

As we know, the failures was a great predictor. So to check how the plot looks if we start with the better predictors I reversed the predictors vector.

predictors  <- rev(predictors)
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()

# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
    # Select a subset of predictors
    current_predictors <- predictors[1:i]
    formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
    set.seed(123)  # for reproducibility
    fitControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
    model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
    
    # Store training and testing errors
    training_errors <- c(training_errors, mean(model$results$Accuracy))
    testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
    number_of_predictors <- c(number_of_predictors, i)
    model_summaries[[i]] <- model$coefnames
}

results_df <- data.frame(
    Number_of_Predictors = number_of_predictors,
    Training_Error = training_errors,
    Testing_Error = testing_errors
)

# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
    geom_line(aes(y = Training_Error, colour = "Training Error")) +
    geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
    labs(title = "Training and Testing Errors by Number of Predictors",
         x = "Number of Predictors",
         y = "Error") +
    theme_minimal()

That makes the model better overall, with the minimum value being:

## [1] 0.2454599

With adding of the guardian variable.

predictors[20]
## [1] "guardian"

This shows that increasing the number of predictors negatively influences the model, but adding valueable predictors imrpoves it.


4: Clustering and classification

4.1.1: Loading necessary libraries

library(MASS)  # For the Boston dataset
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)  # For graphical representations
library(caret)  # For data splitting and preprocessing
library(corrplot)
## corrplot 0.92 loaded

4.1.2: Loading the data and investigating the data structure

# Step 1: Load and Explore the Boston Dataset
data("Boston")
# Exploring the structure and dimensions of the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This dataset contains housing values in the suburbs of Boston, has 506 rows and 14 columns, all numerical, chas can be considered categorical as it can only be 0 and 1. Each row represents a different suburb. Columns are various features like crime rate, number of rooms, age of the housing, etc. More complete description can be found here

The variables have the foillowing decriptions:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - proportion of non-retail business acres per town.
  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighted distances to five Boston employment centres
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - % lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000’s

4.1.3: Graphical Overview and Summary of Variables

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

All the statistics for each variable is not directly comparable, as is expected for real-world data. We can now the relations between the variables to look deeper into the data.

  • The distributions are skewed: nox and dis have skewed to low values, age to high values. The proportion of non-retail business acres per town, has a distribution with two peaks, as does the tax variable.

  • All the variables appear to be correlated. Only 8 pairs are not significantly correlated, all connected to the “categorical variable” Charles River. Most of the correlation make sense: e.g. nox concentrations and the distance to the city are negatively correlated, but nox and industrialisation are positively correlated.

  • Crime rate appears to have a statistically significant correlation all of vars, but chas. Seems to correlate negatively with 5 variables, and positively with 7 variables.

We can plot the correlation in a more visually pleasing way. All the relations between the variables can be seen clearly on this figure

cor_matrix <- cor(Boston) 

corrplot(cor_matrix)

All the relations between the variables can be seen clearly on this figure

4.1.4: Standardising the dataset

As the data described very different phenomena, the values were not directly comparable. E.g. mean value for tax was 408, while nox was ~ 0.55. In order to eliminate that we can standardise the data.

scaled_Boston <- as.data.frame(scale(Boston))
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now all the means are the same - 0. This standardization makes variables more comparable and often improves machine learning model performance.

4.1.4.1: Creating a categorical variable from scaled crime rate

Using quantiles as breakpoints, and removing the old crime rate variable.

quantiles <- quantile(scaled_Boston[, "crim"], probs = c(0, 0.25, 0.5, 0.75, 1))
scaled_Boston$categorical_crim <- cut(scaled_Boston[, "crim"], breaks = quantiles, 
                               include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
scaled_Boston <- scaled_Boston[,-which(names(Boston) == "crim")]

Splitting the Dataset into Train and Test Sets

80% of the data is now in the training set, and the remaining 20% is in the test set.

set.seed(123) # For reproducibility
indexes <- createDataPartition(scaled_Boston$categorical_crim, p = 0.8, list = FALSE)
train_set <- scaled_Boston[indexes, ]
test_set <- scaled_Boston[-indexes, ]

4.1.5: Fitting the linear discriminant analysis on the train set

# Loading necessary library
library(MASS)  # For lda function
library(ggplot2)  # For creating biplot

# Step 1: Fit the Linear Discriminant Analysis Model
# Fitting LDA with categorical crime rate as target variable
lda_model <- lda(categorical_crim ~ ., data = train_set)

# Step 2: Summary of the LDA Model
lda_model
## Call:
## lda(categorical_crim ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2512315 0.2487685 0.2487685 0.2512315 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       0.9279155 -0.9002730 -0.19513102 -0.8740789  0.4510628 -0.8867213
## med_low  -0.0913696 -0.2911617 -0.03844192 -0.5807849 -0.1223160 -0.3793710
## med_high -0.3836558  0.1931104  0.15646403  0.3930228  0.1140281  0.4230461
## high     -0.4872402  1.0149946 -0.07933396  1.0662955 -0.4222547  0.8161999
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8428080 -0.6789912 -0.7285408 -0.39200564  0.37567764 -0.77093834
## med_low   0.3597389 -0.5361295 -0.4615109 -0.03982856  0.31663833 -0.16048504
## med_high -0.3650188 -0.4542586 -0.3302709 -0.28998920  0.08434782 -0.06173942
## high     -0.8559422  1.6596029  1.5294129  0.80577843 -0.76539336  0.90035435
##                medv
## low       0.5324073
## med_low  -0.0140094
## med_high  0.1977447
## high     -0.7285395
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.11143638  0.643005769 -0.9943789
## indus    0.08740548 -0.227273609  0.2772096
## chas     0.01766274 -0.058628846  0.1539449
## nox      0.18483474 -0.897262214 -1.3248989
## rm      -0.01332317 -0.033607221 -0.1228014
## age      0.25569474 -0.405630630 -0.1938632
## dis     -0.12730753 -0.346775059  0.0996160
## rad      4.07862012  0.908418329 -0.1089458
## tax      0.06001905  0.008407111  0.5321260
## ptratio  0.15504095  0.072786263 -0.3434405
## black   -0.10715634  0.034155460  0.1240672
## lstat    0.14772261 -0.131485744  0.4522005
## medv     0.04936251 -0.364822504 -0.2088124
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9637 0.0274 0.0088
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# The biplot visualizes the linear discriminants (LD1 and LD2) and shows
# how the observations in the training set are separated based on the
# categorical crime rate.
classes <- as.numeric(train_set$categorical_crim)

plot(lda_model, dimen = 2, col = classes,pch = classes)
lda.arrows(lda_model, myscale = 2)

The LD1 is influences mostly by rad, the LD2 is influeneces by zn and nox in similar marnitude, but different direction.

4.1.6: Fitting the linear discriminant analysis on the train set

# Step 6: Save Crime Categories and Update Test Set
# Saving the crime categories from the test set
test_crime_categories <- test_set$categorical_crim

# Removing the categorical crime variable from the test set
test_set <- test_set[,-which(names(test_set) == "categorical_crim")]

# Step 7: LDA Model Prediction
# Fit LDA model on the training set
library(MASS) # LDA is in the MASS package
fit_lda <- lda(categorical_crim ~ ., data = train_set)

# Predicting on the test set
predictions <- predict(fit_lda, newdata = test_set)

table, prop.table and addmargins

table(Predicted = predictions$class, Actual = test_crime_categories) %>% 
  addmargins()
##           Actual
## Predicted  low med_low med_high high Sum
##   low       17       3        1    0  21
##   med_low    8      15        4    0  27
##   med_high   0       7       17    1  25
##   high       0       0        3   24  27
##   Sum       25      25       25   25 100

The model is reasonably good, especially at the high and low values, and can be used to differentiate high and low groups. The middle groups become muddled.

4.1.7: K-means clustering

Reloading the dataset, scaling and calculating the distances:

data(Boston)

Boston_scaled <- data.frame(scale(Boston))

dist_euc <- dist(Boston_scaled)
summary(dist_euc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Mean distance is 4.7. In order to determine the optimal number of clusters we will look at the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. We can perform clustering with different k numbers and then look at what happens to the WCSS when the k number is increased. If the value drops rapidly - the k number is good. But the larger the k the harder it may be to interpret the results. So we can start checking and plotting k from 2 to 20, jst to see what happens.

k_max <- 20

twcs <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})

elbow_data <- data.frame(
  k = 1:k_max,
  twcs = twcs
)
elbow_plot <- ggplot(elbow_data, aes(x = k, y = twcs)) +
  geom_line() +  # Connect points with lines
  geom_point() +  # Add points
  scale_x_continuous(breaks = 1:k_max) +  # Ensure k values are treated as continuous
  labs(x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares", 
       title = "Determining Optimal k") +
  theme_minimal()

elbow_plot

The plot changes the slope quite quickly at k = 2, so we can use this clustering for the further analysis.

Boston_scaled_km <- kmeans(Boston_scaled, 2)
pairs(Boston_scaled,col = Boston_scaled_km$cluster)

pairs(Boston_scaled[,c(1,10)],col = Boston_scaled_km$cluster)

There seems to be good cluster separation between crime and tax.

pairs(Boston_scaled[,c(3,5)],col = Boston_scaled_km$cluster)

An even better separation for indus and nox. High indus and high nox cluster, and cluster of both low values.

pairs(Boston_scaled[,c(2,3)],col = Boston_scaled_km$cluster)

There seems also to be good separation for industry and zn variables, two clusters: low indus and high zn, and vice versa. As we saw earlier the chosen variables have high correlation, and logically should also be correlated. So these results make sense.

4.1.8*: Bonus. Visualising clustering results with a biplot:

The elbow plot shows that the last point with decreased WCSS is 6, so I decided to look what k=9 clustering looks like.

data("Boston")

Boston_scaled <- as.data.frame(scale(Boston))


Boston_scaled_km <- kmeans(Boston_scaled, centers = "6")

Redoing the LDA using the cluster as the target classes.

lda.fit <- lda(Boston_scaled_km$cluster ~ . , data = Boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(Boston_scaled_km$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 3)

As we can see, the cluster 2 is saparate from all the othersm and the Charles River variable is the main determinant for this cluster. Clusters 4 and 5 are separate from others, this separation is dependent on the radial roads variable, cluster 5 seems to be separated form the mass.

4.1.9*: Bonus. Trainig data that you used to fit the LDA and visualisation:

Installing/loading plotly

if (!require(plotly) == T) {
  install.packages("plotly")
}
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(plotly)
lda.fit <- lda(categorical_crim ~ . , data = train_set)


model_predictors <- dplyr::select(train_set, -categorical_crim)
# check the dimensions
dim(model_predictors)
## [1] 406  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train_set$categorical_crim, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = factor(Boston_scaled_km$cluster[indexes]), type= 'scatter3d', mode='markers')

The crime colouring shows that high crime group is clustered separately, with some med-high groups. The k-means clustering shows clusters 4 and 5 to be together, as in 2d plain, however the 6th cluster is split between the two large clusters.

4.2 Data wrangling for the next week’s data!

The R script transforming the data for the next week asignment is in the repository, with the human.csv file in the same directory


5: Dimensionality reduction techniques


6: Analysis of longitudinal data